Skip to content

Conversation

@JadenFiotto-Kaufman
Copy link
Member

No description provided.

dtype: str | torch.dtype = "bfloat16"


@ray.remote(num_cpus=2, num_gpus=0, max_restarts=-1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Do we want to have num_cpus hardcoded?
  2. Would it necessarily be bad to have a finite max_restarts value (e.g 5)? What happens if we didn't have it set? It might make it easier to catch silent failures if the actor ends up eventually dying

def purge(self):

for deployment in self.deployments.values():
deployment.delete()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like deployment.delete() shouldn't handle exceptions itself. From a robustness standpoint, I think it's probably important for the node to know if an attempt to delete something failed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1.) What do you mean by "know" 2.) When would we need to know?

Copy link
Member

@MichaelRipa MichaelRipa Nov 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Be aware whether an exception was raised in deployment.delete()
  2. When calling deployment.delete() raises an exception


del self.nodes[node]
node = self.nodes.pop(node_id)
node.purge()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to what I commented regarding deployment.delete(), I think any exceptions that happen in purge should propagate up here, and at the very least be logged as an error before proceeding

@JadenFiotto-Kaufman JadenFiotto-Kaufman merged commit d776634 into dev Nov 13, 2025
@JadenFiotto-Kaufman JadenFiotto-Kaufman deleted the refactor-deployments branch November 13, 2025 18:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants